Embedded ML Intro

Jun 11

Architecture, Toolchains, and Hardware Constraints

A High-level Map for Engineers Deploying ML on Microcontrollers

Most machine learning runs on data centers with racks of GPUs and terabytes of memory. That's not what we're talking about here. We're talking about models that run on devices smaller than a postage stamp, drawing mere milliwatts of power.

Deploying neural networks on microcontrollers isn't a matter of shrinking models. It requires a fundamentally different approach to system design, memory management, and software architecture. This guide covers the system model, toolchain landscape, hardware tiers, and the practical considerations that separate prototype demos from reliable production deployments.

1. The System Model: From Neural Networks to Static Graphs

Redefining What a "Model" Means

On a cloud server, a neural network is a dynamic computational graph. Layers can be added, weights updated, batch sizes varied at runtime. On an MCU, this flexibility is a liability. Embedded ML reframes the model as a static Directed Acyclic Graph (DAG) of signal processing blocks. Every tensor shape, every memory access pattern, every arithmetic operation is fixed at compile time.

This shift enables aggressive compiler optimizations that are impossible in dynamic runtimes. When the compiler knows every tensor dimension ahead of time, it can eliminate bounds checking, pre-compute memory offsets, and schedule operations to maximize cache-line utilization.

Execution Strategy: Determinism Over Throughput

General-purpose ML frameworks optimize for throughput: process as many samples per second as possible, allocate memory dynamically, let the OS handle scheduling. Embedded systems invert these priorities.

Consider a keyword spotting model. Audio arrives continuously, and the system processes it in fixed-size windows (say, 30ms chunks). If inference occasionally takes 200ms instead of 50ms, the audio buffer fills up and overwrites unprocessed samples. The user says "Hey Device" and nothing happens. A motor control inference that varies by ±10ms may cause mechanical instability.

The embedded execution strategy optimizes for deterministic latency and static memory usage. Memory is allocated once at startup and never freed. Execution time is bounded and predictable. This determinism is what allows ML to operate within hard real-time control loops.

Memory Management: The Tensor Arena

Perhaps the most counterintuitive aspect of embedded ML is its approach to memory. Desktop applications call malloc() and free() freely, trusting the heap to manage fragmentation. This pattern is catastrophic on MCUs. Heap fragmentation on a system with 256KB of SRAM can render memory unusable within hours of operation.

The solution is the Tensor Arena: a single contiguous block of SRAM allocated at startup that serves as working memory for all inference operations. The runtime never calls malloc during inference. Instead, it manages buffer allocation within this fixed arena using compile-time scheduling.

The key optimization is buffer reuse through activation overlaying. Consider a three-layer network where Layer A produces activations consumed by Layer B, which then produces activations for Layer C. Once Layer B completes, Layer A's output buffer is no longer needed. The runtime reuses that memory for Layer C's output, dramatically reducing peak memory footprint.

The compiler analyzes the entire computation graph, determines tensor lifetimes, and emits a static allocation plan that minimizes arena size while respecting all dependencies. Getting the arena size right is critical. Too small and inference fails; too large and you're wasting precious SRAM that could be used for application logic.

2. The Software Landscape and Toolchains

Training Frameworks: Where Models Are Born

Model development begins in high-level Python frameworks. PyTorch and TensorFlow dominate here. These environments prioritize researcher productivity: automatic differentiation, dynamic computation graphs, rich debugging tools. The trained model (a set of learned weights and a graph definition) must then be exported for embedded deployment.

The standard interchange format is ONNX (Open Neural Network Exchange). ONNX captures the model's computation graph in a framework-agnostic way. From ONNX, specialized compilers can target specific hardware platforms, applying optimizations appropriate for the target's memory hierarchy and instruction set.

Runtime Architectures: Interpreters vs. AOT Compilers

The choice of runtime architecture represents a fundamental tradeoff between flexibility and efficiency.

Interpreter-Based Runtimes

Runtimes like TensorFlow Lite Micro operate as interpreters: the MCU runs a generic inference engine that parses a model file at runtime.

┌──────────────────────────────────────────────────┐

│ MCU Flash │

├──────────────────────┬───────────────────────────┤

│ TFLite Micro │ Model File │

│ Interpreter │ (FlatBuffer) │

│ (~50-100KB) │ (variable) │

└──────────────────────┴───────────────────────────┘

│ │

└───────────┬──────────────┘

▼

┌─────────────────┐

│ Parse & Execute │ ← (at runtime)

└─────────────────┘

This architecture has compelling advantages. Model updates can be delivered over-the-air simply by replacing a binary blob, no firmware reflashing required. This is invaluable for products that need to improve detection accuracy or adapt to new use cases post-deployment.

The cost is overhead. The interpreter itself consumes Flash (often 50-100KB), and model parsing introduces latency at startup. For severely constrained devices or applications requiring minimal boot time, this overhead may be unacceptable.

Ahead-of-Time (AOT) Compilers

AOT compilers take a different approach: the model is transpiled directly into C/C++ source code with hard-coded loops, memory offsets, and optimized kernel calls. The resulting code is compiled alongside your firmware as native machine code.

┌─────────────┐ ┌──────────────┐ ┌─────────────┐

│ model.onnx │ ──→ │ AOT Compiler │ ──→ │ model.c │

└─────────────┘ │ (TVM, EON, │ │ model.h │

│ emlearn...) │ └─────────────┘

└──────────────┘ │

▼

┌───────────────────┐

│ Firmware Binary │

│ (model baked in) │

└───────────────────┘

Options in this space include TVM, Edge Impulse's EON compiler, Glow, and emlearn (which is particularly strong for classical ML models like Random Forests on very constrained devices).

The benefits are significant. Flash usage drops dramatically since there's no interpreter overhead. Execution is faster because there's no parsing or dispatch logic. Memory access patterns are fully determined at compile time, enabling aggressive optimization.

The tradeoff is rigidity. The model is baked into the firmware binary. Updating the model means rebuilding and reflashing the entire firmware image. For products with stable model requirements, this is acceptable. For products expecting frequent model updates, interpreter-based runtimes may be preferable despite their overhead.

PyTorch on Embedded: The Evolution

PyTorch's embedded story has evolved significantly. The legacy approach, PyTorch Mobile, relied on TorchScript (a subset of Python that could be JIT-compiled). While functional, the TorchScript runtime was too heavy for Cortex-M class devices.

ExecuTorch represents PyTorch's modern embedded runtime. It eliminates the TorchScript JIT dependency, offering a dramatically smaller footprint suitable for microcontrollers. ExecuTorch maintains PyTorch's operator semantics while providing the deterministic execution guarantees embedded systems require.

End-to-End Platforms: Edge Impulse

Platforms like Edge Impulse abstract away much of the toolchain complexity, providing an integrated pipeline from data collection through DSP preprocessing, model training, and deployment. Under the hood, Edge Impulse wraps compilers like EON and TFLite, automatically selecting appropriate optimizations for target hardware.

These platforms significantly accelerate prototyping. Getting a working keyword detector from audio samples to embedded device can take hours rather than weeks. However, production deployments often require deeper customization than these platforms expose. Understanding the underlying toolchains enables engineers to push beyond platform limitations when application requirements demand it.

3. Hardware Constraints and Feasibility Analysis

Every embedded ML feasibility analysis begins with three questions:

Flash: Does the model's read-only data (weights, biases, architecture definition) fit in non-volatile storage? A quantized MobileNet might require 1-4MB; a tiny keyword spotting model might fit in 50KB.

SRAM: Can the device hold the input buffer, output buffer, and largest intermediate activation simultaneously? This "high water mark" often determines feasibility more than total model size. A seemingly small model with a wide intermediate layer can have prohibitive SRAM requirements.

Latency: Can the model execute within the application's timing budget? Latency is primarily determined by total multiply-accumulate (MAC) operations divided by effective throughput. A 1M MAC model on a 100MHz Cortex-M4 will take roughly 50-100ms depending on memory access patterns and DSP utilization.

Quantization: The Affine Transform

Quantization is the single most impactful optimization for embedded deployment. The core idea is mapping 32-bit floating-point values to 8-bit integers using an affine transformation:

r = S × (q - Z)

where r is the real value, q is the quantized integer, S is the scale factor, and Z is the zero point.

The benefits compound. Memory bandwidth drops 4x because you're moving 8-bit values instead of 32-bit floats. Storage requirements drop 4x. And critically, inference can use integer arithmetic units and DSP SIMD instructions rather than floating-point units (which many MCUs lack entirely). Real-world speedups of 4-10x are common.

Post-training quantization works well for many architectures, but some models require quantization-aware training to maintain accuracy. This is particularly true for models with narrow layers or unusual activation patterns.

Memory Layout: The NCHW vs. NHWC Problem

A subtle but critical consideration is tensor memory layout. PyTorch defaults to NCHW format (Batch, Channels, Height, Width), where channels are stored contiguously for each spatial position. ARM's CMSIS-NN library and TensorFlow Lite optimize for NHWC format (Batch, Height, Width, Channels), where spatial positions are stored contiguously for each channel.

When formats mismatch, the runtime must transpose tensors. This memory-bound operation can consume significant cycles on every layer. A model trained in PyTorch and deployed via TFLite may spend more time transposing data than computing convolutions.

The solution is designing with deployment in mind from the start. Train in the layout your target runtime expects, or ensure your toolchain handles conversion without runtime overhead.

4. Hardware Tier Mapping

Not all microcontrollers are created equal. Understanding which workloads are feasible on which hardware tiers prevents wasted engineering effort and ensures appropriate platform selection. The tier names below are my own taxonomy for thinking about capability classes.

Nano Tier: Cortex-M0+ / M3

These ultra-low-power cores lack DSP extensions and often have limited SRAM (8-64KB). Neural network inference is generally not viable. Even a tiny 8KB model would dominate available memory, and without SIMD instructions, inference latency becomes prohibitive.

However, these devices excel at classical statistical analysis: mean/variance calculations, threshold detection, and lightweight decision tree models. Tools like emlearn can deploy Random Forests and small MLPs that fit comfortably on these devices. Many sensor preprocessing tasks are well-served by this tier at minimal power cost.

Typical applications: Environmental monitoring, simple anomaly detection, battery-powered sensor nodes with multi-year lifetimes.

Micro Tier: Cortex-M4F / M33

The sweet spot for many embedded ML applications. These cores include DSP extensions with SIMD instructions (like ARM's SMLAD, which performs simultaneous multiply-accumulate on dual 16-bit values). With 128-512KB of SRAM and 1-2MB Flash, meaningful neural networks become feasible.

Typical applications: Audio keyword spotting ("Hey Siri" style wake words), IMU-based gesture recognition, multi-sensor fusion for activity classification, vibration-based predictive maintenance.

Performance Tier: Cortex-M7 / M55 / M85

Higher clock speeds (400MHz+), larger caches, and more sophisticated pipelines enable more complex models. Low-resolution vision tasks become practical: 96×96 grayscale person detection, simple object classification, basic scene understanding.

The Cortex-M55 and M85 introduce Helium (M-Profile Vector Extension), ARM's SIMD architecture for embedded ML. Helium provides substantial acceleration for quantized inference, narrowing the gap between MCUs and dedicated accelerators for many workloads.

At this tier, memory bandwidth often becomes the bottleneck before compute does. Flash-to-cache transfer rates limit how quickly weights can be fed to the execution units. Careful attention to memory access patterns and model architecture can yield significant performance gains.

Typical applications: Person detection, simple image classification, audio scene classification, sophisticated sensor fusion.

Acceleration Tier: NPU / Ethos-U

For demanding workloads (real-time object detection, high-fidelity audio processing, complex vision pipelines), dedicated Neural Processing Units become necessary. ARM's Ethos-U family is designed to pair with Cortex-M CPUs, providing 10-100x acceleration for supported operations.

The NPU handles compute-intensive convolution and fully-connected layers (often 95% of total operations). The CPU handles "glue" logic: reshape operations, softmax normalization, and application-specific pre/post-processing. This division requires careful orchestration to avoid pipeline stalls.

Typical applications: Real-time object detection, multi-face recognition, voice command with large vocabulary, medical image analysis, autonomous navigation.

5. Production Considerations

The Supported Operator Constraint

Here's a hard lesson many teams learn late in development: hardware runtimes support a limited subset of operators compared to PyTorch or TensorFlow. A model that runs perfectly on your development machine may fail to compile for your target device because it uses an unsupported activation function, pooling variant, or layer type.

The professional approach is to design backwards from the target hardware's allow-list. Before architecting your model, examine which operators your deployment toolchain supports. Avoid HardSwish if your runtime only supports ReLU. Skip Squeeze-and-Excitation blocks if channel-wise operations aren't accelerated. This constraint-first design prevents painful late-stage rewrites.

DSP vs. ML: Choosing the Right Tool

Not every sensing problem requires machine learning. Many problems that appear to need "AI" are actually well-solved by classical digital signal processing.

Consider vibration analysis for predictive maintenance. An FFT to extract frequency components, followed by threshold detection on specific harmonics, may outperform a neural network while using a fraction of the resources. The analytical approach is also more interpretable: you can explain exactly why the system flagged an anomaly.

ML shines where analytical definitions fail: distinguishing between "normal" vibration patterns that vary in complex ways, detecting anomalies that don't correspond to known failure modes, or fusing multiple sensor modalities where the relationship isn't easily characterized mathematically.

The optimal architecture often combines both approaches. DSP handles feature extraction (FFT, RMS calculation, zero-crossing detection), followed by ML for the pattern recognition that's difficult to hand-code. This hybrid approach leverages the efficiency of DSP while applying ML only where it provides clear value.

Real-World Problem Domains

Embedded ML enables intelligent edge devices across numerous domains:

Predictive Maintenance: Analyzing vibration signatures, acoustic emissions, and current waveforms to detect bearing wear, motor imbalance, or impending failures before they cause downtime. Particularly valuable in industrial settings where sensor nodes must operate for years on battery power.

Audio Intelligence: Beyond keyword spotting, embedded ML enables speaker identification, sound event detection (glass breaking, dog barking, baby crying), and voice activity detection. These capabilities transform simple microphones into intelligent audio sensors.

Motion and Gesture: IMU-based activity classification (walking, running, cycling, fall detection), gesture recognition for device control, and fine-grained motion analysis for sports performance or physical therapy applications.

Visual Sensing: Person detection for occupancy sensing, object classification for inventory management, quality inspection in manufacturing, and wildlife monitoring in conservation applications.

Environmental Monitoring: Multi-sensor fusion for air quality assessment, anomaly detection in water treatment, and agricultural sensing for precision farming.

Moving Forward

Embedded ML represents a fundamental shift in how we build intelligent systems. The constraints are real: limited memory, fixed compute budgets, deterministic timing requirements. But so are the opportunities. Devices that once merely sensed can now understand. Products that previously uploaded raw data for cloud processing can now act on insights locally, preserving privacy and eliminating connectivity dependencies.

Success requires thinking differently about the entire development pipeline. Model architecture must be designed with deployment constraints in mind from day one. Toolchain selection must balance flexibility against efficiency for your specific use case. Hardware selection must account for not just peak performance but power budget, cost at scale, and long-term availability.

The field is evolving rapidly. New compilers squeeze more performance from existing silicon. Novel architectures like tiny transformers are being adapted for MCU deployment. Hardware vendors are adding ML acceleration to even entry-level microcontrollers. Staying current requires continuous learning and empirical experimentation.

If you're working on an embedded ML project and want to talk through architecture decisions, toolchain tradeoffs, or feasibility questions, feel free to reach out. We've shipped enough of these systems to have opinions.

Carlo Pecora Grisafi

Embedded ML Intro

Architecture, Toolchains, and Hardware Constraints

Feasibility Analysis for Embedded ML